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Abstract: Using the maximum entropy method, we derive the "adaptive clus- 
ter expansion" (ACE), which can be trained to estimate probability density 
functions in high dimensional spaces. The main advantage of ACE over other 
Bayesian networks is its ability to capture high order statistics after short train- 
ing times, which it achieves by making use of a hierarchical vector quantisation 
of the input data. We derive a scheme for representing the state of an ACE net- 
work as a "probability image", which allows us to identify statistically anomalous 
regions in an otherwise statistically homogeneous image, for instance. Finally, 
we present some probability images that we obtained after training ACE on 
some Brodatz texture images - these demonstrate the ability of ACE to detect 
subtle textural anomalies. 

1 Introduction 

The purpose of this paper is to train probabilistic network models of images of 
homogeneous textures for use in Bayesian decision making. In our past work in 
this area p] El El El E] we successfully used entropic methods to design Markov 
random field (MRF) models to reproduce the observed statistical properties 
of textured images. We now wish to formulate a novel MRF structure that 
requires much less effort to train and use. There are two essential ingredients in 
our simplification: we do not use hidden variables, and we restrict our attention 
to hierarchical transformations of the data. 

The use of hidden variables is a flexible way of modelling high order corre- 
lations in data but it leads to lengthy Monte Carlo simulations to estimate 
averages over the hidden variables. An MRF without hidden variables is speci- 
fied by a set of transformation functions, each of which extracts some statistic 
from the data, and together they provide sufficient information to compute the 
probability density function (PDF) of the data |3]Ej- 

We can obtain a wealth of statistical information about the data by restrict- 
ing our attention to a finite number of well-defined transformation functions. 

•This paper was submitted to IEEE Trans. PAMI on 2 May 1991. Paper PAMI No. 91- 
05-18. It was not accepted for publication, but it underpins several subsequently published 
papers. 



For instance, in a number of useful textural features are presented, which 
may be used to model and discriminate between various textures that occur in 
images. However, we wish to design our transformation functions adaptively 
in a data-driven manner, so that the resulting set is optimised to capture the 
statistical properties of the data. We choose to use adaptive hierarchical trans- 
formation functions, because these not only capture statistical properties at 
many length scales, but are also very easy to train. 

We briefly discussed hierarchical transformation functions in p|, where we 
conjectured that topographic mappings [9 J might be appropriate for connecting 
together the layers of the hierarchy. We investigated topographic mappings in 
|l()l 1111 1121 ll.'SI HI] and found that they could be rapidly trained to produce use- 
ful multiscale representations of data. We therefore use multilayer topographic 
mappings to adaptively design hierarchical transformation functions of data for 
use in MRF models. In this type of model different layers of the hierarchy 
measure statistical structure on different length scales, and shorter length scale 
structures are clustered together and correlated to produce longer length scale 
structures. We therefore frequently refer to this type of scheme as an adap- 
tive cluster expansion (ACE). By interpreting ACE as a multilayered n-tuple 
processor we can relate ACE to a multilayered version of WIS ARD fT[% ■ 

We demonstrate the ability of ACE to learn the statistical structure of tex- 
ture by training an adaptive pyramid image processor. There are many ways of 
displaying the statistical information extracted from the data by such a proces- 
sor, but we prefer to use what we call a "probability image", which is generated 
from the estimated local PDF of the data. 

The layout of this paper is as follows. In Section we use the maximum 
entropy method to estimate the PDF of the data, subject to a set of marginal 
probability constraints measured using hierarchical transformation functions, to 
yield an MRF model in closed form (i.e. no undetermined Lagrange multipliers). 
In Section we extend this result to remove some of the limitations of its 
hierarchical structure, such a translation non-invariance, and describe the ACE 
system for producing probability images. In Section 0] we present the result of 
applying ACE to some textured images taken from the Brodatz set 

2 Maximum entropy PDF estimation 

In this section we present a derivation of a hierachical maximum entropy esti- 
mate Q mem {x) of an observed true PDF P(x), where we constrain Q mem (x) so 
that certain marginal PDFs agree with observation. Although we consider only 
the case of a binary tree, we also present a simple diagrammatic representation 
of this result that allows us easily to extend it to general trees. 

2.1 Basic maximum entropy method 

For completeness, we first of all outline the basic principles of the max- 

imum entropy method of assigning estimates of PDFs. Introduce the entropy 
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functional H 

H=- [ d x Q{x)\og(^j\) (1) 
J Qo{x) 

in which the PDF Qo(x) is used to introduce prior knowledge about P(x). 
Loosely speaking, H measures the extent to which Q(x) is non-committal about 
the value that x might take. The maximum entropy method consists of max- 
imising H subject to the following set of constraints 

C M = / dxQ{x)yi{x) - J dxP(x)yi(x) , . 

= [Z) 

where the Vi{x) are the components of a vector y(x) of sampling functions. 
These constraints ensure that certain average values are the same whether they 
are measured using Q(x) (i.e. our estimated PDF) or using P(x) (i.e. the ob- 
served true PDF). By carefully selecting the y(x) we can optimise the agreement 
between Q(x) and P(x) as appropriate. 

Qmem(^) may be found by introducing a vector A of Lagrange multipliers, 
and functionally differentiating H — J2i with respect to Q(x) to yield 

eventually 

n / \ QoQg) exp(-A.j/(x)) 

QjntmW - j dx , Qo{xl) exp( _ A . 2/(x0) W 

The undetermined Lagrange vector A must be chosen in such a way that the 
constraints are satisfied - this is usually a non-trivial problem. 

Now we shall consider a special case of the maximum entropy problem in 
which we carefully design the y%{x) so that they constrain a set of marginal 
probabilities 0. Thus we make the following replacements 

Vi(x) — >8{y-y{x)) 

A, — \{y) {i} 

where 8{y — y{x)) is a Dirac delta function. In the {j/j(x), Aj} version of the max- 
imum entropy problem, by varying the value of an index i we could scan through 
the set of constraint functions yi(x) and Lagrange multipliers Ai. However, in 
the {S(y — y{x)), A(y)} version of the maximum entropy problem, by varying 
the value of a variable y we can scan through the set of constraint functions 
S(y — y(x)) and Lagrange multipliers X(y). 

The modification in Equation 0] causes the constraints in Equation to 
become 

C 2 (y) = J dxQ(x)S(y - y(x)) - J dxP(x)S(y - y{x)) 

= Q(y) - P{y) (5) 

= 

where we have defined the PDFs over y as 

Q(y) = 1 dxQ(x)S(y - y(x)) 

P{y) = J dxP(x)8(y - y(x)) 1 > 
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Thus the delta function constraints force Q(y) = P(y)- Note that we have 
used a rather loose notation for our PDFs - P(x) and P(y) are in fact different 
functions of their respective arguments. We have made this choice of notation 
for simplicity, because the context will always indicate unambiguously which 
PDF is required. 

By analogy with the previous maximum entropy derivation, Q mem (j:) may 
be found by functionally differentiating H — j dy\(y)C2(y) with respect to Q(x) 
to yield 

n (rr \ _ Qoi x ) e*p(-A(2/(s))) (7 , 

VmemW - j ^Q^,) ex p(-A(y (*'))) ^ 

— Q (x)f(y(x)) (8) 

where X(y(x)) is an undetermined Lagrange function of y{x). In Equation 0D 
we present a simpler notation by introducing an undetermined function f(y(x)) 
to absorb the exponential function and the denominator term that appeared 
in Equation We may impose the constraints in Equation El and use the 
definitions of Q(y) and P(y) in Equation to obtain f(y) in the form 

fly) = EM (9) 

J(V) Jdx'Qo(x')6(y-y(x')) W 
and Qmem (a 7 ) m the form 

n ( \ - Qa( x ) p (y( x )) / 1n -> 

Qmem[ ) ~ J dx'Qo(x')S(y(x) y(x' j) (W) 

Note that this result is a closed form solution because it contains no undeter- 
mined Lagrange functions, unlike Equation which contains an undetermined 
Lagrange vector A. The normalisation of this solution can be verified as follows 

JdxQ mem (x) = J dxdyS(y - y(x)) { dxl Q°^yl y[x ,)) 

= fdyP(y) (11) 
= 1 

where we use the identity J dy6(y — y{x)) = 1 to create a dummy integral over 



2.2 Hierarchical maximum entropy method 

The purpose of this subsection is to present a generalisation of Eoua.tionlToltha.t 
uses hierarchical transformation functions. 

In practice the result in Equation El has a limited usefulness. Firstly, we 
would like to impose many simultaneous constraints, each using its own con- 
straint function S(yt — yi(x)) in Equation 03 but this cannot in general be done 
without sacrificing our closed form solution in Equation Secondly, we would 
like to impose higher order constraints, using a constraint function S(y — y{x)). 
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This may easily be done by making the replacement y — ► (y% , y-x , • • • ) in Equa- 
tionE3 However, there is a hidden problem, because the greater the dimension- 
ality of y, the less easy is it to make the necessary measurements to establish 
the form of P(y). Fortunately, there is a solution to both of these problems, 
which we shall describe below. 
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Figure 1: Notation used in the hierarchical maximum entropy derivation. 

We shall apply the maximum entropy method with constraints of the form 
shown in Equation to a hierarchy of transformed versions of the input vec- 
tor x. In order to make our calculation tractable we introduce the notation 
shown in Figure^] The Xijk... are various partitions of the input vector x, the 
yijk... are various transformed versions yijk...( x ijk...) of the input x^k..., and 
the fijk- ,i'j>k'- are the Lagrange functions fijk- ,i'j'k'-(yijk-,yi'j'k'-) that 
appear in the generalised version of the maximum entropy solution Q mem (x) in 
Equation 

We choose to write the dependence of yijk... directly on the input Xijk... , even 
though the value of yijk... is obtained via a number of intermediate transforma- 
tions leading from the leaf nodes of the tree up to node ijk..., because this leads 
to a transparent hierarchical maximum entropy derivation. It is convenient to 
define Hijk---{ x ijk---) as the product of the Lagrange functions that appear be- 
neath node ijk... of the tree. Uijk—(xijk—) has the following recursion property 



ILjfc... (Xijk...) — fijk—l,ijk--2(yijk---l(Xijk--l)>yijk—2(Xijk---2)) 
X Hijk—l(Xijk—i)^lijk—2(Xijk—2) 

Also introduce a normalisation (or Jacobian) factor defined as 



(12) 



Z ij k...( yi jk...) = jdx i jk...6 {yi jk...-y^^ (13) 

which is a sum of LTyfc... (x^k—) over all the states Xijk— of the leaf nodes beneath 
node ijk... that are consistent with yijk... emerging at node ijk.... 

The proof of the general hierarchical maximum entropy result proceeds in- 
ductively. Firstly, we generalise Equation Q] to become 

y%{x) — > S (yijk-i - yijk-x(xijk-i)) S(yijk-2 - y%jk-2{xijk-2)) 

-V ► ^ijk—(yijk—l,yijk — 2) 
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Secondly, we generalise Equation |H| to become 

Qmem(x) = Qo(x)fl, 2 (yx (xi), 2/2 (X2))U 1 (xi)IL 2 (x 2 ) (15) 

where we display the Lagrange function /i,2(2/i; 2/2) that connects the topmost 
node-pair (i.e. node-pair (1,2)) in the tree, but conceal the other Lagrange 
functions by using the II ijfc ... notation. 

We may determine the exact form of /i, 2(2/1, V2) independently of the rest of 
the Lagrange functions (which are hidden inside the Hi(xi) and n 2 (x 2 ) func- 
tions) by imposing the constraint shown in Equation and Equation U3 (as 
applied to node-pair (1,2)) to obtain 



^1,2(2/1,2/2) =/ dx 1 dx 2 S(y 1 - y 1 (x 1 ))S(y 2 - y2(x2))Qmem(x) 
= hAvi, V2)Zx (yi)Z 2 (y 2 ) 



(16) 



which yields 

f l \ fl^C?/!, 2/2) nr? s 

/l,2(2/l,2/2) = r? , s„ , x (17) 

Zi (2/1)^2(2/2) 

Substituting this result back into Equation 1151 yields 

Qtnem(^) = ^1,2 2/1 fol ), 2/2^ 2 ) , , r r- . . 18 

Zi(yi(xi)) ^2(2/2(^2)) 

which correctly obeys the constraint on the joint PDF Pi, 2(2/1, 2/2) of the topmost 
pair of nodes in the tree. 

We now marginalise Q mei n{x) in order to concentrate our attention on the 
left-hand main branch of the tree. Thus 

<3l,mem(zi) = / dx 2 Q mem (x) 

= J dx 2 dy 2 6(y 2 - y 2 (x 2 ))Q meta (x) (19) 

We now use the recursion property given in EquationEHo extract the Lagrange 
function associated with node-pair (11,12). Thus Qi, mem {xi) becomes 

Qi,mem(xi) = Pi(yi{xi))fii } i 2 (yu(xn),yi 2 (xi 2 )) ^^^^^- (20) 

^1(2/1(^1)) 

As before, we may determine the exact form of /n, 12(2/11, 2/12) independently 
of the rest of the Lagrange functions by applying the constraints to node-pair 
(11,12) to obtain 

- / \ P lh 12(2/11,2/12) ^l(2/l) ,„,x 
711,12(2/11,2/12) = 5-7 r ^—7 T7y 7 7 (21) 

P\\Vx) ^11(2/11)^12(2/12) 

where the value of 2/1 is to be understood to be obtained directly from the 
values of j/11 and y% 2 via the mapping which connects node-pair (11, 12) to node 
1. Substituting this result into Equation 1201 yields 

n f \ v ! ! \ 1 \\ Hii^n) ni 2 (xi 2 ) , 
Qx t mem[Xi) = Pll,l 2 {yxi{xix),yi 2 {Xx 2 ))-=— \T "7^ 7 — 7 — rr (22) 

^11(2/11(^11)) ^12(2/12(^12)) 
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By inspection, we see that Equation ^| and Equation are identical in form 
once we have accounted for their different positions in the tree, so we may use 
induction to obtain all of the rest of the Lagrange functions in the form 



fijk—l,ijk—2 {Vijk--lt Vijk—z) 



Pijk--\,ijk--2{yijk--\,yijk--2) 
Pijk-iVijk-) 

x Z ijk ...(y ijk ...) 

Zijk-\(yijk-\)Zijk-2{yijk--2) 



(23) 



which is analogous to Equation 1211 and where y^... is obtained directly from 
the values of yijk—i and yijk—i- The § factors may be discarded once we reach 
the leaf nodes of the tree, because the integral in Equation El then reduces to 
Z = 11. 

Finally, by starting with Equation El and recursively simplifying the LTjjk... 
using Equation El an d substituting for the Lagrange functions fijk-- ,i'j'k'-- 
using Equation E] we obtain eventually for an n-layer tree 



>(*) 



n-2 2 

n n 

2 

n 



i 1 i 2 - -i fc l,il »j ■ 



2(yi X i 



2{xi 1 i 



Pi, 



in ( 2 '«l«2---«7i) 



(24) 

where we have rearranged the terms to collect together the factors that each 
node-pair (iii2 ■ ■ ■ ijfel, HI2 • ■ ■ ife2) contributes. 

Although we have concentrated on deriving Q mem (a:) for a binary tree, the 
principle of the derivation carries over unchanged to arbitrary tree structures, 
and Equation El may easily be generalised. In Appendix B we explain the 
relationship of the single layer version of Equation E] to the random access 
memory network that is known as WISARD 



2.3 Diagrammatic notation 




Figure 2: The individual steps of the inductive hierarchical maximum entropy 
derivation. 

We now present the steps in the inductive derivation leading from Equation 
n~8l to Equation [22] as a diagram in Figure EJ We use a triangle to represent 
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a subtree, and we indicate its apex node, its associated II or factor, and 
its dependence on x. Figure EK represents Equation which is a pair of 
trees connected by the joint PDF of their apex nodes. By integrating over 
x 2 we remove the right hand tree to obtain Figure Eb, which corresponds to 
Equation^!! We then explicitly display the two daughter nodes to obtain Figure 
Et, which corresponds to Equation EDI although we have grouped the terms 
together slightly differently, for simplicity. This exposes one of the Lagrange 
functions which we determine explicitly to obtain FigureEJi, which corresponds 
to Equation E21 One cycle of the inductive proof is completed by noting the 
correspondence between Figure EK and Figure EH- 




Figure 3: A diagrammatic representation of the hierarchical maximum entropy 
result. 

We represent Equation|21in diagrammatic form in Figure^ The tree struc- 
ture represents the flow of the transformations of the original input data x. 
Each square cornered rectangle represents the marginal PDF of the enclosed 
node-pair (i.e. one Pi x i 2 ---i n term from the second factor in Equation EH . Each 
round cornered rectangle represents the normalised marginal PDF of the encosed 
node-pair (i.e. one p ' 1 ' 2 '" ,fcl p 1 ' 2 '"' fc2 term from the first factor in Equation EH) - 
Overall, we obtain Equation [M| as the product of the rectangles in Figure y| 

This notation makes it easy to generalise the result in Equation |^1 in a 
purely diagrammatic fashion, by firstly constructing an arbitrary (i.e. not nec- 
essarily binary) tree-like transformation of the input data, and secondly using 
as maximum entropy constraints the marginal PDF of each set of sister nodes 
in the tree. This prescription permits many possible ACE structures, including 
those in which different constraints effectively operate between different layers 
of the hierarchy (by mapping one or more node values directly from layer to 
layer) . 

Each rectangle representing a marginal PDF in Figure contributes to the 
maximum entropy estimate of the PDF of a cluster of nodes in the input data. 
Because of the tree structure, clusters at each length scale are built out of 
clusters at smaller length scales. Equation|2]tells us exactly how to incorporate 
into Qmem(^) any additional statistical properties that might be observed when 
forming larger clusters out of smaller clusters in this way. 

Finally, Figure suggests an informal derivation of Equation Thus the 
expression for the maximum entropy estimate of the joint PDF of the input data 
x in Equation ^] can be viewed as the joint PDF of the pair of nodes at the 
top of the tree in Figure times corrective Jacobian factors that compensate 
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for effects of the many-to-one mapping that the input data undergoes before it 
reaches the top of the tree. The final maximum entropy expression in Equation 
1241 merely enumerates these corrective Jacobian factors explicitly in terms of 
marginal PDFs measured at various levels of the tree. This makes it clear 
that the maximum entropy method gives a result that is consistent with simple 
counting arguments, which could therefore be used in place of the rather involved 
maximum entropy derivation. 



3 Implementation of an anomaly detector 

Henceforth we shall refer to our hierarchical maximum entropy method as an 
adaptive cluster expansion (ACE) . In this section we describe how to implement 
Equation El in software. We assume that the ACE transformation functions 
have already been optimised using the unsupervised network training algorithm 
that we describe in Appendix A and in J^j, so the purpose of this section is 
to explain how to manipulate Equation |21 into a form that produces a useful 
output from the network. For concreteness, we produce an output in the form of 
an image that represents the degree to which each local patch of an input data 
is statistically anomalous, when compared to the global statistical properties of 
the input data. 

3.1 Two-dimensional array of inputs 




Figure 4: ACE connectivity for processing a 2-dimensional array of inputs. 

In Section we represented ACE as if it were operating on a 1-dimensional 
arrays of inputs (e.g. time series). In practice this might indeed be the case, but 
in this paper we choose to study 2-dimensional arrays of inputs (e.g. images). 
There is no difficulty in applying ACE to an image, provided that we appro- 
priately assign the leaf nodes to pixels of the image. In Figure 0] we show the 
simplest possibility in which the image is alternately compressed in the north- 
south and east-west directions. A priori, the choice of whether to start with 
north-south or east-west compression is arbitrary, but if we knew, for instance, 
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that the image had stronger short range correlations in the east-west direction 
than the north-south direction, then it would be better to compress east-west 
first of all. Note that in Figure0]the topology of the tree is the same as in Figure 
03 but the way in which the leaf nodes are identified with the data samples is 
different. 

More generally, we could identify the leaf nodes of the tree with the image 
pixels in any way that we please, provided that no pixel is used more than 
once (to guarantee that the tree-like topology is preserved). The problem of 
optimising the identification of leaf nodes with pixels is extremely complicated, 
so we shall not pursue it in this paper. 

3.2 Histograms 

The maximum entropy PDF in Equation[2|is a product of (normalised) marginal 
PDFs. In a practical implementation of ACE the yijk— are discrete- valued quan- 
tities (for instance, integers in the interval [0, 255]), and the Pijk--- ,i'j'k'---{Vijk---, Vi'j'k'--) 
are probabilities (not PDFs). We estimate the Pijk— ,i'fk'—(yijk—, Vi'j'k'— ) by 
constructing 2-dimensional histograms 

Pijk- ■■ .%' j' k f ■■■ {yijk- ■■ s Vi' j' k' ■■ ■) — "T7 hijk- - - ,i' j' k' - - - (yijk- - - j Ui ! j' k' ■ ■ ■) (25) 

l*ijk-- ,i'j'k'-- 

where hijk... ,i' j'k'-- -(y ijk- ■-, Vi'j'k 1 ■■■) is the number of counts in the histogram bin 
(yijk---, Vi'j'k'—), and N is the total number of histogram counts given by 

Nijk--- .i'j'k'--- = ^ ^ hijk— ,i'j'k'—(Vijk—, Vi'j'k' — ) (26) 
Vijk- Vi'j'k'... 

Note that the estimate in Equation |2£] suffers from Poisson noise due to the 
finite number of counts in each histogram bin. 

In order to build up this estimate we first of all train the ACE transformation 
functions as explained in Appendix A. The histogram bins are then initialised to 
zero, and subsequently filled with counts by exposing the trained ACE to many 
examples of input vectors (possibly, the set used to train the transformation 
functions). Thus each vector is propagated up through the ACE-tree, and we 
then inspect each node-pair {ijk ■ ■ ■ , i'j'k' • • • ) for which a marginal probability 
needs to be estimated, and increment its corresponding histogram bin thus 

hijk-- ,i'j'k'--{Vijk--, Vi'j'k' ■ ■■) ► hijk-- - ,i'j'k'--{Vijk--- , Vi'j'k'--) + 1 (27) 

When the training set has been exhausted, histogram bin (ijk • • • , i'j'k' • • • ) 
records the number of times that state {yijk— ■, Vi'j'k'--) occurred. 

A major disadvantage of using histograms is that they have a large number 
of adjustable parameters (i.e. the number of counts in each bin) that have 
to be determined by the training data, so they do not generalise very well. 
However, for the purpose of this paper, we do not need to resort to using more 
sophisticated ways of estimating PDFs. 
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3.3 Translation invariant processing 



We wish to detect statistical anomalies in images which have otherwise spa- 
tially homogeneous statistics, such as textures. An invariance of the statistical 
properties of the true PDF P(x) can be expressed as 



P{Qx) = P(x) 



(28) 



where Q is any element of the invariance group, which we shall assume to be 
the group of translations of the image pixels. In Equation [21 Q-mnmix) does not 
respect translation invariance for two reasons. Firstly, we use transformations 
l/ijk--(Xijk---) that are explicitly translation variant, because the functional form 
depends on the ijk- ■ ■ indices. Secondly, we connect together these transfor- 
mations in translation variant way, because the tree structure in Figure Hand 
FigureQ]does not treat all of its leaf nodes equivalently. We shall therefore mod- 
ify the cluster expansion procedure that we derived in Section l2~2*l to guarantee 
translation invariance. This will lead to a much improved maximum entropy 
estimate Q me tn(%) of the true P(x). 

Firstly, use the same transformation function at each position within a single 
layer of ACE. Thus in Equation [^1 we make the replacement 

l{Xi 1 i 2 ---i k 2) * V {Xi 1 i 2 - 



■ikl) 
■■ikV 



(29) 



where we indicate that the transformation is associated with the fc-th layer of 
ACE by attaching a superscript k to each function. This yields 



n-2 2 

n n 

k=0 ,ifc=l 



Pi 1 i 2 ...i k i,i 1 i 2 ...i k 2(y k (x lll2 ... lk i),y k (x ni2 ... ik2 )) 
Pi 1 i 2 -i k i(v k (xi 1 i 2 .^ k i))Pi^ 2 -.i k 2(v k (x l ^ 2 ..^ k2 )) 



2 

n Pi±i 2 - i 



„ {Xiii 2 ---i n ) 



(30) 

EquationEDlguarantees translation invariance (in the sense of a "single-instruction- 
multiple-data" computer) of the processing that occurs when the input data is 
propagated upwards through the overlapping trees. 

Secondly, assume that Equation |2H1 holds for all image translations, so that 
the marginal PDFs are independent of position. We may make this explicit in 
our notation by making the following replacement in Equation EQ1 



l'2--'fc 1 .'l'2--'fc 



2(0 



..- -» fc l(-)-Pil>2---ifc2(-) 

Piii 2 — in (') 



Pf,2(;) 
P k (-)P k (-) 
P«-l(.) 



(31) 



where we use the same superscript notation as in Equation |22l This yields 

P£2(y k (xi 1 i 2 -i k l),V k (,Xi 1 i2-i k 2)) 



n-2 2 

n 11 P, k (y k (xi,i,... il , 1 ))Pf(y k (x i ,i 

fc=0ii,ia,— ,»*=! 1 12 k 2 1 



n p- 

tl,t2r" , l k = l 



(32) 
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Equation 021 guarantees not only translation invariance of the transformations 
that propagate the data through the tree, but also translation invariance of the 
marginal PDFs of P(x) that are used to construct Q me m(x)- 

Both of the simplifications in Equation 1221 and Equation 0H1 reduce the total 
number of unknowns that have to be determined. For a given amount of training 
data we can thus construct a better maximum entropy estimate Q m em {%) of the 
true P(x). The transformation functions may be optimised better, and the 
histogram bins have a reduced Poisson noise. 

We usually apply ACE to such large input arrays that it is not appropriate 
to build a single binary tree whose leaf nodes encompass the entire input array. 
Instead, we divide the input array (which we shall assume is a 2 M x 2 M array 
of image pixels) into a set of contiguous 2™ 1 x 2™ 2 arrays, each of which we 
analyse using Equation 1221 There are no constraint functions to measure the 
mutual dependencies between these subarrays, so the maximum entropy joint 
PDF of the set of subarrays is a product of terms of the form shown in Equation 

on 



- 2 2 1 2 2 2 »Uy\<l£U k i),v\<X*.^)) 



l°g(Qmem(a;)) ~ E E E E lo g( pk ( k , x «i A 2 )) P kl kflA^z )J 

( 2 U-m 1 2 M-m 2 2 \ 

EE E tesiP^KCO)) 
ai = l Q2 = l ii,i2,— !*fc=l / 

(33) 

The summation over (a\,a-z) ranges over the 2 2M mi ™ 2 contiguous subarrays 
in the overall 2 M x 2 M array, and the ai,a 2 superscript on each x^... vector 
indicates that it belongs to subarray {a\,a-i). Note that we have transformed 
QmemM — > log(Q m em (x)) for convenience. 

The final step in constructing a fully translation invariant PDF is to modify 
the sum over subarrays so that it includes all possible placements of the 2 mi x 
2™ 2 subarray within the overall 2 M x 2 M array. There are 2 2Af - mi " m2 possible 
positions when the placement of the subarray is restricted as in Equation 03 
whereas there are (2 M — 2 mi + 1)(2 M — 2™ 2 + 1) possible positions when all 
placements of the subarray are permitted. We therefore make the replacement 



2 M- mi 2 M-m 2 2M -m -m 2 2 M -2 ra l +1 2 M -2"*H1 

E E (') * (2 M -2 m i+l)(2 M -2 m 2+l) E E (') 

Ol = l 02 = 1 Pl=l P2=l 

^ 2~ mi— ma ^2 (•) 

Pl = l P2=l 



(34) 



in Equation OSE where (pi,j»2) is the coordinate of the pixel in the top left hand 
corner of the 2 mi x 2™ 2 subarray. If we ignore edge effects, then we may use the 
approximation in the final line of Equation 031 which is the average of 2™ 1+ " 12 
separate contributions of the form shown in EquationOSl Equation|3H effectively 
replaces the original maximum entropy PDF Q mem (x) by the geometric mean of 
a set of maximum entropy PDFs. This averaging reduces the problems caused 
by Poisson noise on the histogram bin contents to yield a greatly improved 
maximum entropy PDF estimate. 
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Figure 5: Connectivity for multiple overlapping binary trees. 



In practice, we would implement each layer of ACE as a frame store, and 
the transformation between each pair of adjacent layers as a look-up table. The 
translation invariant ACE that we derived in EquationEHl(with the replacement 
given in Equation l3lj) may be implemented using the connectivity shown in 
Figure Ignoring edge effects, we may write Equation EH symbolically as 

n-2 pk 1 

log(Q mem (x)) ~ £ — ]>> g (^_) + -^log(p-i) (35) 

fe=0 1 2 

where the inner summations range over all positions within a single layer of 
Figure We omit all of the functional dependencies, because they are easy 

pk 

to obtain from Figure El Each term is represented by a rectangle with 

rounded corners in FigureEl and each P n ~ l term is represented by a rectangle 
with square corners in Figure We have not drawn these rectangles in Figure 
El because they would overlap, and thus confuse the diagram. 

3.4 Forming a probability image 

Equation EH is the fundamental result that we use to construct useful image 
processing schemes. However, it would not be very useful simply to calculate 
the value of log(Q mem ) as a single global measure of the logarithmic probability 
associated with an image. We choose instead to break up Equation E3 into 
smaller pieces, and to examine their contribution to the overall log(Q mem ). In 
effect, we look at how log(Q mem ) is built up from the information in each layer 
of ACE, which in turn we break down into contributions from different areas of 
the image. 

In order to ensure that our decomposition of log(Q mem ) can be easily com- 
puted, we use the backpropagation scheme shown in FigureEJto control the data 
flow through a translation invariant network of an identical connectivity to the 
one shown in Figure El Each node of this backpropagation network records a 
logarithmic probability, and is cleared to zero before starting the backpropa- 
gation computations. The rectangles in Figure El represent exactly the same 
logarithmic probability terms that appeared in Figure El which we now use as 
sources of logarithmic probability that we inject into the backpropagating data 
flow. 
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Figure 6: Backpropagation scheme for constructing a probability image. 



The detailed operation of Equation [301 is as follows. Each addition symbol 
takes as input a contribution recorded at a node in the next layer above, adds 

pk 

its own logarithmic probability source logf gras )< scales the result by 7. and 
it finally adds a copy of this result to the value stored at each of its own pair 
of associated nodes, as shown. The values that accumulate at the leaf nodes 
represent various contributions to the sum in Equation EiU If the translation 
invariant version of Figure E] is applied to the translation invariant network 
shown in Figure El then the sum of the values that accumulate at the leaf nodes 
reproduces Equation EID precisely. 

This method of computing log(Q mem ) might seem to be circuitous, but it 
has the great advantage of both being computationally cheap and forming an 
image-like representation of log(Q m em), which we call a "probability image". 

p k . 
Each log( „kpk ) term in Equation EH will contribute equally to 2 n ~ k pixels in 

the probability image. These pixels will be arranged as either a square or a 

2-to-l aspect ratio rectangle according to whether there is an odd number or 

even number of backpropagation steps from the k-th layer to the leaf nodes. The 

probability image is therefore a superposition of square and rectangular tiles of 

logarithmic probability. Each tile corresponds to a node of the network shown 

in Figure El 

It is useful to display as an image the contributions of a single layer of 
the network to the probability image, because different layers contribute to the 
structure of log(Q mem ) at different length scales. This image may be displayed 
in the conventional way, with small probabilities mapped to black, large prob- 
abilities mapped to white, and intervening probabilities mapped to shades of 
grey, in which case we call it a "probability image". It is also useful to invert 
the grey scale so that small probabilities map to black, in which case we call 
it an "anomaly image", because regions which have statistical properties that 
occur infrequently show up as bright peaks in the image. We find that the use 
of probability images and/or anomaly images is an extremely effective way of 
visually interpreting log(Q mem ) in Equation 1351 
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3.5 Modular implementation 




Figure 7: Three layer translation invariant ACE system. 

For completeness we now present a brief description of a complete system 
for producing probability and/or anomaly images. This system consists of two 
tightly coupled subsystems - an ACE subsystem for decomposing the image 
data, and a probability image subsystem for forming the output image. Figure 
[3 combines in one diagram all of the results that we have discussed so far. The 
upper part of Figure is a pure translation invariant ACE subsystem, whereas 
the lower half is a backpropagating probability image subsystem operating as 
shown in Figure EJ The backpropagating subsystem takes input information 
from various layers of ACE, as shown. Modules "I" are framestores that record 
the various transformed images. Modules "M" are look-up tables that record the 
inter-layer mappings. Modules "T" represent the training algorithm that we ex- 
plain in Appendix A, which we enclose in a dashed box because the "T" modules 
are switched out of the circuit once the mappings "M" have been determined. 
Modules "H" are accumulators that record the 2-dimensional histograms, and 
then regularise and normalise them appropriately. Modules "P" are framestores 
that record the various backpropagated probability images. Modules "log" are 
look-up tables (in fact only one such table is needed) that implement a log- 
arithm function. Modules "© " and "® " perform the addition and scaling 
operations that we discussed earlier in connection with Figure El "N" is scaling 
factor (which is j if we wish to reproduce the result in Equation |2HJ . The lines 
that are annotated "G" represent a ganging together of the (pointers to) pixels in 
adjacent layers of the ACE subsystem and in the probability image subsystem. 
These ensure that the entire system works in lockstep, as required. 

The simplest mode of operation of this system can be broken down into three 
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stages Firstly, train each layer (from left to right) of the ACE subsystem on a 
training image. Secondly, propagate a test image (from left to right) through 
the layers of ACE. Finally, construct a probability image by backpropagating 
(from right to left) contributions from the various layers of ACE. Furthermore, it 
is useful to display separately the probability (or anomaly) images that emerge 
from each layer of ACE, as we shall see in Section 0J 

There is a variety of methods of optimising "T", and hence "M". The method 
that we describe in Appendix A trains each layer in sequence, which takes 2.3 
second per layer (using a VAXstation 3100, and assuming 6 bits per pixel), 
which gives a full training time of 20 seconds for the 8 layer network that we 
use in our numerical simulations. We do not make use of more sophisticated 
schemes in which different layers are simultaneously trained, whilst communi- 
cating information with each other to improve the global performance of ACE. 

3.6 Relationship to co-occurrence matrix methods 

Both the basic maximum entropy PDF Q mem (x) in Equation^! and the trans- 
lation invariant version of log(Q mem (x)) in Equation EH that we implement 
in practice, depend on various PDFs that are measured in an ACE-tree. The 
second term of Equation ESI niay be written as 



Each P™ -1 factor is the spatial average of the marginal PDF of pairs of adjacent 
pixel values, assuming that we use the identification of leaf nodes with pixels 
that we show in Figure 0J The square root in Equation 1361 compensates for the 
fact that the product of P n_1 factors generates the product of two maximum 
entropy PDFs shifted by one pixel relative to each other. 

By using Equation [2^1 we may approximate Equation EEI as a product of his- 
tograms. In this case each histogram is the spatial average of the co-occurrence 
matrix of pairs of adjacent pixel values, as commonly used in image processing 
(Zj. Thus we may use conventional co-occurrence matrix methods to construct 
a simple form of maximum entropy PDF, which corresponds to using only one 
layer of ACE. 

This co-occurrence matrix result can be generalised, using Equation l2~TI or 
Equation|33 to model higher order statistical behaviour. Although these results 
depend on co-occurrence matrices measured at various places in the ACE-tree, 
the contributions which do not depend directly on the input data (i.e. the first 
term of Equation 13511 actually model higher order statistics of the input data. 
This is because the value yijk— that emerges from node ijk- ■ ■ of the ACE- 
tree depends on ccyfc..., so the joint PDF Pijk-\,ijk-2{yijk-\,yijk-2) depends 
on the statistics of the pair (£yfc...i> Xijk—2)- Thus ACE is a very convenient 
way of combining together the various orders of statistical information that are 
contained in co-occurrence matrices at various places in the ACE-tree, as shown 
in Figure El 




(36) 
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4 Numerical results 



In this section we explain the finer details of how to implement Figure in 
software, and we present the results of applying the system to four 256 x 256 
images of textiles taken from the Brodatz texture set ^H] . 

4.1 Experimental procedure 

We compensated for some of the effects of non-uniform illumination by adding 
to each image a grey scale wedge whose gradient was chosen in such a way as to 
remove the linear component of the non-uniformity. Not only does this improve 
the translation invariance of the image statistics, but it also improves the quality 
of the hierarchical coding of the image, because we reduce the need to develop 
redundant codes which differ only in their overall grey level. 

Throughout our experiments we generate optimal inter-layer mappings us- 
ing the training methods that we explain in Appendix A. These are known 
as topographic mappings in the neural network literature, and we showed in 
[19] why they are appropriate for building multistage vector quantisers. We 
choose to compress the image in alternate directions using the following se- 
quence: north/south, east/west, north/south, east/west, etc. This compression 
sequence leads to the following sequence of rectangular image regions that in- 
fluence the state of each pixel in each stage of ACE: 1 x 2, 2 x 2, 2 x 4, 4 x 4, 
etc, using (east/west, north/south) coordinates. In all of our experiments we 
use an 8 stage ACE. 

The number of bits per pixel that we use in each layer of ACE determines 
the quality of the hierarchical vector quantisation that emerges. Increasing the 
number of bits improves the quality of the vector quantisation but increases the 
training time: we need to compromise between these two conflicting require- 
ments. In our work on simple Brodatz texture images we have found that 6-8 
bits per pixel is sufficient. 

It is important to note that for a given number of bits (after compression) 
there is an upper limit on the allowed entropy that the input data can have. 
This problem becomes more severe the greater the data compression factor (i.e. 
the further we progress through the layers of ACE). For instance, if the input 
image is very noisy then 6-8 bits will be sufficient only to give good vector 
quantisation performance in the first few layers of ACE. This problem arises 
because ACE does not have much prior knowledge of the statistical properties 
of the input data, so each node of ACE encodes its input without assuming a 
prior model. A prior model would allow us to reduce the bit rate. This is a 
fundamental limitation to the capabilities of the current version of ACE. 

The choice of the size of the 2-dimensional histogram bins is also important. 
A property of the topographic mappings that we use to to connect the layers of 
ACE is that adjacent histogram bins derive from input vectors that are close to 
each other (in the Euclidean sense), so it is sensible to rebin the histogram by 
combining together adjacent bins. Thus we control the histogram bin size by 
truncating the low order bits of each binary vector that represents a pixel value. 



17 



If we do not truncate any bits, then the 2-dimensional histogram faithfully 
records the number of times that a pair of pixel values has occured. However, if 
we truncate b low order bits of each pixel value then effectively we sum together 
the histogram bins in groups of 2 2b (= 2 b x 2 b ) adjacent bins, which smooths 
the histogram. The more smoothing that we impose the less Poisson noise the 
histogram suffers. However, as we smooth the histogram we run the danger of 
smoothing away significant structure that might usefully be used to characterise 
the input image: so we need to make a compromise. In our Brodatz texture 
work we use only 4-6 bits of each pixel value to generate the histograms in 
each stage of ACE. Note that we use more bits for vector quantisation than 
for histogramming because the vector quantisation needs to be good enough to 
preserve information for encoding by later layers of the hierarchy, whereas the 
histogramming information is not passed to later layers. 

In Equation EH we need to estimate the logarithm of various probabilities 
from the histograms. We do this in two stages. Firstly, we regularise the 
histograms by placing a lower bound on the permitted number of counts. One 
possible prescription is to ensure that each histogram bin has a number of counts 
at least as large as the average number of counts in all the histogram bins (as 
determined before regularising the histogram) . Thus 



where the angle brackets (• • • ) denote an average over histogram bins, rounded 
up to the next largest integer to avoid setting histogram bins to zero. Secondly, 
we estimate the probabilities Pijk-.- ,i'j'k'---(yijk---,yi'j'k'---) by inserting the regu- 
larised histograms into Equation [2U We use a marginalised version of Equation 
l25l to estimate the marginal probabilities Pijk—{Vijk—)- Finally, we compute 
the logarithmic probabilities in Equation EH by using a table of logarithms of 
integers, up to the maximum possible number of counts that could occur in a 
histogram bin - it suffices to tabulate logarithms up to log(iV). 

The prescription in EquationE3is crude but effective. We could improve the 
performance by introducing prior knowledge of the statistical properties of the 
input data. Our histogram smoothing prescription already implicity makes use 
of prior knowledge of the properties of the Posson noise process that affects the 
histogram counts, and prior knowledge of the fact that adjacent histogram bins 
correspond to similar input vectors. Additional prior knowledge would further 
enhance the performance, especially in cases where there is a limited amount of 
training data (such as small images, or small segments of larger images). 

A pitfall that must be avoided is using histogram bins that are too small 
when one intends to train ACE on one image and then use a different image to 
generate a probability image. Effectively, the large number of small bins records 
the details of the statistical fluctuations of the training image (as particular 
realisations of a Poisson noise process in each bin), which thus acts as a detailed 
record of the structure in the training image. The histograms thus look very 
spiky, and in an extreme case there may be a counts recorded in only a few 



h%jk-- ■ ,i' j f k' ■■■ (jjijk--- j Vi' j' k' ■ ■ ■ ) 




hijk-- ■ ,i' j'k 1 ■ ■ ■ (yijk-- ■ , l/i'j'k' ■ ■ ■ ) 

(hijk- ,i'j'k'-(yijk-,yi'j'k'-)) 



h > (h) 
h < (h) 
(37) 
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bins with zeros in all of the remaining bins. If this situation occurs then the 
training image records a large log(Q mem (x)), whereas a test image having the 
same statistical properties records a small log(Q mem (a;)). Effectively, the spikes 
in the training and test image histograms are not coincident. This problem can 
be solved by choosing a large enough histogram bin size. 

Finally, we display the logarithmic probability image as follows. We deter- 
mine the range of pixel values that occurs in the image, and we translate and 
scale this into the range [0,255]. This ensures that the smallest logarithmic 
probability appears as black, and the largest logarithmic probability appears as 
white, and all other values are linearly scaled onto intermediate levels of grey. 
This prescription has its dangers because each probability image determines its 
own special scaling, so one should be careful when comparing two different prob- 
ability images. It can also be adversely affected by pixel value outliers arising 
from Poisson noise effects, where an extreme value of a single pixel could affect 
the way in which the whole of an image is displayed. However, we find that 
the overlapping tree prescription in Figure together with the backpropaga- 
tion prescription in Figure EjJ causes enough effective averaging together of the 
histogram bins that we do not encounter problems with pixel value outliers. 

In all of the images that we present below, we compensate for the uneven 
illumination by introducing a grey scale wedge as we explained earlier, we use 8 
bits per pixel for vector quantisation, we use 6 bits per pixel for histogramming, 
and we invert the [0, 255] scale to produce an anomaly image, in which a white 
pixel indicates a small (rather than a large) logarithmic probability. 

4.2 Texture 1 



Figure 8: 256 x 256 image of Brodatz fabric number 1. 

In FigureHOwe show the first Brodatz texture image that we use in our exper- 
iments. The image is slightly unevenly illuminated and has a fairly low contrast, 
but nevertheless its statistical properties are almost translation invariant. 

In Figure |2l we show the anomaly images that derive from Figure |S| Note 
how the anomaly images become smoother as we progress from Figure to 
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Figure Eh, due to the increasing amount of averaging that occurs amongst the 
overlapping backpropagated rectangular tiles that build up each image. 

Figure and especially Figure Et reveal a highly localised anomaly in the 
original image. Figure Et corresponds to a length scale of 8 x 8 pixels, which is 
the approximate size of the fault that is about \ of the way down and slightly 
to the left of centre of Figure|H| The fault does not show up clearly on the other 
figures in Figure El because their characteristic length scales are either too short 
or too long to be sensitive to the fault. 

There is a major feature in the bottom right hand corner of FigureEh, where 
the anomaly image is darker than average, indicating that the corresponding 
part of the original image has a higher than average probability. This is a 
different type of anomaly to the sort that we have envisaged so far - it occurs 
because the corresponding part of original image happens to explore only a high 
probability part of the space that is explored by the whole image. This part 
of the anomaly image is surrounded by a brighter than average border, which 
indicates a conventional anomalous region. 

From Figure EH we conclude that ACE can easily pick out localised faults in 
highly ordered textures. 
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Figure 9: 256 x 256 anomaly images of Brodatz fabric numbe 
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4.3 Texture 2 




Figure 10: 256 x 256 image of Brodatz fabric number 2. 

In Figure E3 we show the second Brodatz texture image that we use in our 
experiments. The image has a high contrast and translation invariant statistical 
properties. 

In Figure El we show the anomaly images that derive from Figure E3 The 
most interesting anomaly image is Figure fTTT' which shows several localised 
anomalies. About halfway down and to the left of centre of the image is an 
anomaly that corresponds to a dark spot on the thread in Figure The 
brightest of the anomalies in the cluster just above the centre of the image cor- 
responds to what appears to be a slightly torn thread in Figure [Till The other 
anomalies in this cluster are weaker, and correspond to slight distortions of the 
threads. There is another anomaly just below and to the right of the centre 
of Figure lll^ r. which corresponds to what appears to be another slightly torn 
thread in Figure These anomalies all occur at, or around, a length scale of 
8x8 pixels. Several of the anomaly images show an anomaly in the bottom left 
hand corner of the image, which corrsponds to a small uniform patch of fabric 
in Figure E3 

The results in Figure El corroborate the evidence in Figure El that ACE 
can be trained in an unsupervised fashion to pick out localised faults in highly 
ordered textures. 
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Figure 11: 256 x 256 anomaly images of Brodatz fabric number 2. 
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4.4 Texture 3 




Figure 12: 256 x 256 image of Brodatz fabric number 3. 

In Figure H*2l we show the third Brodatz texture image that use in our exper- 
iments. The image has a very high contrast and statistical properties that are 
almost translation invariant. However the density of anomalies is much higher 
than in either Figure El or Figure llOl 

In Figure El we show the anomaly images that derive from Figure fl2l The 
most prominant anomaly is in Figure Hl%. at a length scale of 8 x 16 pixels, 
which corresponds to region of Figure El that ls just above and to the left of 
centre of the image. This region is anomalous because it is both distorted and 
has slightly thicker threads than elsewhere. The large distorted region in the 
bottom left hand corner of FigurelT2ldoes not show up very clearly to the naked 
eye in Figure El but Figure ET and Figure Eh have significant peaks in this 
region. There are also many other localised peaks in Figure El which can be 
traced back to corresponding faults in Figure El 

Comparing Figure El with Figure El and Figure El we conclude that the 
ability of ACE to pick out faults is degraded as the density of faults increases. 
This is because the faults themselves are part of the statistical properties that 
are extracted by ACE, and if a particular fault occurs often enough in the image 
then it is no longer deemed to be a fault. 
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Figure 13: 256 x 256 anomaly images of Brodatz fabric number 3. 
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4.5 Texture 4 

In this section we present a slightly different type of experiment in which we 
train ACE on one image and test ACE on another image. To create the two 
images we start with a single 256 x 256 image of a Brodatz texture, which we 
divide into a left half and a right half. We then use the left half to build up the 
training image, and the right half to build up the test image. 




Figure 14: 256 x 256 image of Brodatz carpet for training. 
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Figure 15: 256 x 256 image of Brodatz carpet for testing. 

In Figure E] we show the training image which is a montage of two copies of 
the left hand half of a Brodatz texture image. Note that this montage contains 
only as much information as was present in the original half image from which 
it was constructed. In Figure El we show the test image which is a montage of 
two copies of the right hand half of a Brodatz texture image, and superimposed 
on that is a 64 x 64 patch which we generated by flipping the rows and columns 
of a copy of the top left hand corner of this image. This patch is a hand crafted 
anomaly. Note that in constructing these images we have scrupulously avoided 
the possibility that the training and test images could contain elements deriving 
from a common source. 
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Figure 16: 256 x 256 anomaly images of Brodatz carpet. 
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In Figure ITTfl we show the anomaly images that derive from Figure fTol after 
having trained on Figure ^| Figure llfif shows the strongest response to the 
anomalous patch in the centre of the image, corresponding to anomaly detection 
on a length scale of 8 x 8 pixels. 

5 Conclusions 

Using maximum entropy methods, we have shown how to construct maximum 
entropy estimates of PDFs by using adaptive hierarchical transformation func- 
tions to record various marginal PDFs of the data, which we call an "Adaptive 
Cluster Expansion" (ACE). This method is a member of the same family as 
the trainable MRF known as the Boltzmann Machine, but it uses sophisticated 
transformations of the input data rather than hidden variables to characterise 
the high order statistical properties of the training set. The simulations in this 
paper use hierarchical topographic mappings to build these transformations, but 
this is a convenience, not a necessity. 

We have also shown how to extend ACE so that it can be applied to transla- 
tion invariant image processing, such as the detection of statistical anomalies in 
otherwise statistically homogeneous textures. Our methods show great promise, 
not only because they are amenable to a full theoretical analysis leading to 
closed-form maximum entropy solutions, but also because they lead directly to 
a modular system design which can locate anomalies in textures. 

We have presented several examples where ACE successfully detects anoma- 
lous regions in otherwise statistically homogeneous textures. In all cases ACE 
adaptively extracts the global statistics of an image at various length scales 
during the unsupervised training, which takes 20 seconds (on a VAXStation 
3100) for the 8 layer ACE network that we applied to this problem. ACE then 
uses these statistics to form an output image that represents the probability 
that each local patch of the input image belongs to the ensemble of patches 
presented during training. We call this a "probability image". 

Some possible applications of our results are as follows. Inspection of textiles: 
this relies on the assumed statistical homogeneity of an unflawed piece of textile, 
so that faults show up as anomalies, which we have demonstrated successfully 
in this paper. Detection of targets in noisy background clutter in radar images: 
this is basically a noisy version of the textile inspection problem, which goes 
somewhat beyond what we have presented in this paper, because it needs to 
address the problem of the noise entropy saturating ACE. Texture segmentation: 
this is an ambitious goal which requires much further analysis in order to derive 
a computationally cheap method of handling multiple simultaneous textures. 

A Vector quantisation 

In this appendix we summarise the hierarchical vector quantisation method that 
we presented in detail in ^2]. In this paper we use this technique to optimise 
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the inter-layer mappings in Figured We have applied this technique elsewhere 
to image compression [201, and multilayer self-organising neural networks |101 



A.l Standard vector quantisation 

This subsection contains those details of the theory of standard vector quanti- 
sation that one needs to understand before proceeding to the modified vector 
quantisation scheme that we present in Section TA. 21 

The problem is to form a coding y of a vector x in such a way that a good 
estimate x' of x can be constructed from knowledge of y alone. The sketch 
derivation in this section is presented in greater detail in £03- Thus a vector 
quantiser is constructed by minimising a Euclidean distortion D\ with respect 
to the choice of coding function y(x) and decoding function x'(y), where 

D x = I dxP(x) \\x - a;'(y(a;))|| 2 (38) 



input code f ^reconstruction 

5E -* y 



Figure 17: Encoding and decoding in a vector quantiser. 



We may represent the encoding and decoding operations diagrammatically 
as shown in Figure El By functionally differentiating Di with respect to y{x) 
and x'{y) we obtain 



' )Di = P( X )jL\\ X >(y)- X \f 



8y(x) dy 
SDi 



(39) 



5x'(y) 

— n 



= 2 / dxP(x)S(y - y{x)) (x'(y) ~ x) (40) 
Setting g*?K — in Equation [SSI yields the optimum encoding function 



y(x) = arg \\x - x' (y(x))f (41) 

which is called "nearest neighbour encoding". Setting jfr^y = in Equation 1401 
yields the optimum decoding function 

, J dxP(x)S(y — y{x))x 
X {y) = J dxP(x)S(y - y(x)) (42) 
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which is the update scheme derived in [21]. Alternatively, we may use an in- 
cremental scheme to optimise the decoding function by following the path of 
steepest descent which we may obtain from Equation BQ1 as 

5x'(y) = eS(y — y(x)) (x — x'(y)) whereO < e < 1. (43) 

An iterative optimisation scheme may be formed by alternately applying Equa- 
tion^land then either Equation^Jor Equation^] This scheme will alternately 
improve the encoding and decoding functions until a local minimum distortion 
is located. Alternating Equation ^] and Equation El is commonly called the 
"LBG" (after the authors of pT] l or "k-means" algorithm. 

A. 2 Noisy vector quantisation 

This subsection contains the theoretical details of the optimisation of inter- 
layer mappings that we use in our numerical simulations in Section 0J Thus we 
generalise the results of Section IA.1I to the case where the encoded version of 
the input vector is distorted by a noise process {2211231 1131 114j . 
Define a modified Euclidean distortion D 2 as 

D 2 = [ dxP(x) [ dnir(n) \\x-x'(y(x) + n)\f (44) 
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Figure 18: Encoding and decoding in a noisy vector quantiser. 

We may represent the encoding and decoding operations together with the 
noise process diagrammatically as shown in Figure which is a trivially mod- 
ified version of Figure El By functionally differentiating D 2 with respect to 
y(x) and x'{y) we obtain 



,)D ' = P(x) f dmr(n)-^- \\x'{y) - x\\ 2 



5y(x) J dy 

SD 



(45) 

y=y(x)+n 



5x'(y) 



= 2 / dxP(x)ir{y - y(x)) {x'{y) - x) (46) 



Equation 03 is a "smeared" version of Equation 020 so Sy ^ = does not lead 
to nearest neighbour encoding because the distances to other code vectors have 
to be taken into account in order to minimise the damaging effect of the noise 
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process. However, it is usually a good approximation to use the nearest neigh- 
bour encoding scheme shown in Equation El Setting g§77^ = in Equation 1461 
yields the optimum decoding function 

,. , f dxP(x)ir(y — y(x))x , . 

J dxP{x)ir(y - y(x)) 

which should be compared with Equation El Alternatively, we may obtain a 
steepest descent scheme in the form 

Sx'(y) = eir{y — y{x)) (x — x'(y)) whereO < e < 1 (48) 

which should be compared with Equation 021 

As in Section lA~Tl iterative optimisation schemes can be constructed in which 
we alternate the optimisation of the coding and decoding functions. Alternating 
Equationl41l ( which approximately solves gjp?^ = 0) and Equation yields the 
standard topographic mapping training algorithm |||, which is widely used in 
various forms in neural network simulations. 



A. 3 Hierarchical vector quantisation 

In Figure El we show the simplest type of hierarchical vector quantiser. It 
consists of an inner quantiser contained in the dashed box, surrounded by a 
pair of outer quantisers. If the part of the diagram contained in the dashed box 




Figure 19: Encoding and decoding in a hierarchical vector quantiser. 

were removed and direct connections made so that y[ = 2/1 and y' 2 = 2/2, then 
Figure El would reduce to a pair of independent vector quantisers of the type 
shown in Figure El The dashed box contains a vector quantiser which encodes 
(2/1,2/2) to produce a code which it subsequently decodes to obtain (2/1,2/2)- 

From the point of view of 2/1 the effect of being passed through the inner 
quantiser is to modify y\ thus 2/1 — * y[- A similar argument applies to 2/2 — ► 
2/2 . The actual distortions y[ — 2/1 and y' 2 — 2/2 will be correlated in practice, 
but we shall model them as if they were independent processes, and thus reduce 
Figure El to two independent vector quantisers of the type shown in Figure El 
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This procedure can be extended to a hierarchical vector quantiser with any 
number of levels of nesting. From the point of view of the quantisers at any level, 
we shall model the effect of the quantisers inwards from that level as independent 
distortion processes. It turns out not to be critically important what precise 
distortion model one uses, provided that it approximately represents the overall 
scale of the distortion due to quantisation. 

In we presented in detail a phenomenological distortion model that we 
used to obtain an efficient training procedure for topographic mappings and 
their application to hierarchical vector quantisers. Alternatively, the standard 
topographic mapping training procedure in [§] could be used, but this is a 
rather inefficient algorithm. The basic training procedure may be obtained 
from Equation 1481 as 

1. Select a training vector x at random from the training set. 

2. Encode x to produce y (= y{x)). 

3. For all y' do the following: 

(a) Determine the corresponding code vector x'{y'). 

(b) Move the code vector x'(y') directly towards the input vector x by 
a distance eir(y — y(x))\x — x'(y)\. 

4. Go to stepQ] 

This cycle is repeated as often as is required to ensure convergence of the code- 
book of code vectors. 

The standard method j|| specifies that n(y' — y) should be an even unimodal 
function whose width should be gradually decreased as training progresses. This 
allows coarse-grained organisation of the codebook to occur, followed progres- 
sively by ever more fine-grained organisation, until finally the algorithm con- 
verges towards an optimum codebook. 

In our own modification of the standard method we replace a shrinking 
7r(y' — y) function acting on a fixed number of code vectors by a fixed 7t(y' — y) 
function acting on an increasing number of code vectors. There are many minor 
variations on this theme, but we find that it is sufficient to define 



where we have absorbed e in Equation0Hlinto the definition of n(y' — y). We use 
a binary sequence of codebook sizes N = 2, 4, 8, 16, 32, • • ■ , where each codebook 
is initialised by interpolation from the next smaller codebook. We find that the 
following parameter values yield adequate convergence: e = 0.1, e' = 0.05, and 
we perform 20iV training updates before doubling the value of N and progressing 
to the next larger size of codebook. The N — 2 codebook can be initialised using 
a random pair of vectors from the training set. 




y' -i/| = i 
y'-y\ >i 



y = y 



(49) 
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B Relationship to WISARD 




Figure 20: Single layer ACE is WISARD. 

The second bracketed term in Ecma,tionl2Hcould be implemented in hardware 
as shown in Figure [23 This implementation assumes that the state vector x 
is quantised by representing each of its components using a finite number of 
binary digits (bits). Note that we have taken advantage of the fact that we 
are discussing a single layer network in order to simplify the notation in Figure 
EDI (as compared with Equation E3|. The z-th block in this circuit is a random 
access memory (RAM) which records a transformation from Xi (the address of 
an entry in the RAM) to log Pi(xi) (the corresponding entry in the RAM). The 
data bus at the bottom of Figure [201 carries the components of x (represented 
bitwise) to the relevant RAM. Note that each bit of x is used exactly once in 
forming addresses for the RAM, so the mapping from x to the set of addresses 
is bijective. The upper part of Figure EQl shows how the outputs are directed to 
an accumulator where they are summed to form log<5 mem (^)- 

Figurel2lHis a variant of the WISARD pattern recognition network P3|- The 
elements that our MEM solution and WISARD have in common are: a bijective 
mapping from the bits of an input state vector onto the address lines of a set of 
RAMs, and the accumulation of the outputs of the RAMs to form the overall 
network output. 

However, there are some differences between the single layer ACE and the 
WISARD prescriptions for the contents of the RAMs. ACE specifies a set of 
functions (i.e. logarithms of marginal probabilities) to tabulate in the RAMs. 
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Suppose that we truncate these table entries to a 1-bit representation, so that 
we use to represent small logarithmic probabilities and 1 to represent large log- 
arithmic probabilities. Each entry (i.e. 1 or 0) in the table then records whether 
the configuration of binary digits (i.e. the address of the entry) frequently oc- 
curs in the set of patterns corresponding to P{x) (the "training set"). The final 
output is therefore the total number of l's that the input pattern addresses in 
the n tables. In effect, this is the total number of coincidences between config- 
urations of bits in the input pattern and those in a predefined category. This 
1-bit version of ACE is qualitatively the same as the table look-up and summa- 
tion operations performed in the simplest WISARD network, which completes 
the connection that we sought between ACE and WISARD. 
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